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Abstract 

Learning and decision making in the brain are key processes critical to survival, and yet are 
processes implemented by non-ideal biological building blocks which can impose significant 
error. We explore quantitatively how the brain might cope with this inherent source of error by 
taking advantage of two ubiquitous mechanisms, redundancy and synchronization. In particu- 
lar we consider a neural process whose goal is to learn a decision function by implementing a 
nonlinear gradient dynamics. The dynamics, however, are assumed to be corrupted by pertur- 
bations modeling the error which might be incurred due to limitations of the biology, intrinsic 
neuronal noise, and imperfect measurements. We show that error, and the associated uncertainty 
surrounding a learned solution, can be controlled in large part by trading off synchronization 
strength among multiple redundant neural systems against the noise amplitude. The impact 
of the coupling between such redundant systems is quantified by the spectrum of the network 
Laplacian, and we discuss the role of network topology in synchronization and in reducing the 
effect of noise. A range of situations in which the mechanisms we model arise in brain science 
are discussed, and we draw attention to experimental evidence suggesting that cortical circuits 
capable of implementing the computations of interest here can be found on several scales. Fi- 
nally, simulations comparing theoretical bounds to the relevant empirical quantities show that 
the theoretical estimates we derive can be tight. 



1 Introduction 

Learning and decision making in the brain are key processes critical to survival, and yet are pro- 
cesses implemented by imperfect biological building blocks which can impose significant error. We 
suggest that the brain can cope with this inherent source of error by taking advantage of two ubiqui- 
tous mechanisms: redundancy, and sharing of information. These concepts will be made precise in 
the context of a specific model and learning scenario which together can serve as a conceptual tool 
for illustrating the effect of redundancy and sharing. 

Motivated by the problem of learning to discriminate, we consider a neural process whose goal 
is to learn a decision function by implementing a nonlinear gradient dynamics. The dynamics, how- 
ever, are assumed to be corrupted by perturbations modeling the error which might be incurred. 
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This general perspective is intended to capture a range of possible learning instances occurring at 
different anatomical scales: The neural process can involve whole b rain areas communicating via 
behavioral, motor or sensory pathways (iSchnitzler and Gross!. 120051) . as in the ca se of the multi- 
ple amygdala -thalamus loops assumed to underpin fear conditioning for instance ( LeDoux , 2000l : 



Marenl. 120011) . Interacting local field potentials (LFPs) may also be m odeled as both direct (by long 



range phase-locking, e.g. in olfactory systems (|Friedrich et all 12004) ) or indirect measurements of 
coordination and interaction among large assemblies of neurons. The learning dynamics may alter- 
natively model smaller ensembles of individual neurons, as in primary motor cortex, though we do 
not emphasize biological realism in our models at this scale. Nevertheless, one may still draw useful 
conclusions as to the role of redundancy and information sharing. The error too may be treated at 



different scales, and may take the form of noise intrinsic to the neural environment (|Faisal et al 



20081) on a large, aggregate scale (e.g. in the case of LFPs) or on a small scale involving localized 



populations of neurons. 

If there is noise corrupting the learning process, an immediate question is whether it is possible 
to gauge the accuracy of the predictions of the learned function, and to what extent the organism can 
reduce uncertainty in its decisions by taking advantage of a simple, common information sharing 
mec hanism. If there is redundancy in the form of multiple independent copies of the dynamical cir- 
cuit ( Adams . 19981 : Fernando et al. , 2010 ). it is reasonable to expect that averaging over the different 
solutions might reduce noise via cancelation effects. In the case of learning in the brain, however, 
this approach is problematic because neurons are susceptible to saturation of their firing rates, and 
on large scales aggregate signal amplitudes will also saturate; the macroscopic dynamics that neu- 
ron populations and assemblies obey can be strongly nonlinear. When the dynamics followed by 
different dynamical sys tems are nonlinear, on e cannot expect to gain a meaningful signal by lin- 
ear averaging (see e.g. ( Tabareau et al. . 2O10h and examples therein). As a simple illustration of 
this phenomenon, consider a collection of noisy sinusoidal oscillators allowed to run starting from 
different initial conditions, with identical frequencies and independent noise terms. The oscillators 
will be out of phase from each other, so an average over the trajectories will not yield anything 
close to a clean version of a sinusoid at the desired frequency. On the other hand it is reasonable to 
suppose that synchronization across neuron populations or between macro-scale cortical loops may 
provide sufficient phase alignment to make linear averaging, and thus "consensus", a powerful pro- 
posal for reducing t he effects of nois e ( Tabareau et al. . 2010 : Masuda et al. . 2010 : Cao et al. . 201fj| : 
Young et all l201(ll : IPoulakakis et~all l201(]l : iGigante et al L 120091) . Indeed, it is a well known fact, 



that synchrony within a system of coupled dynamical el ements provides (quantifiable) robustness to 
pertu r bations occurring i n any one element's dynamics (INeedleman et alll200ll : IWang and Slotind. 



2005J; iPham and Slotind. 120071) . 



We will place much emphasis on exploring quantitatively the role of synchrony in controlling 
uncertainty arising from noise modeling neural error. In particular, we base our work on the ar- 
gument that noisy, nonlinear trajectories can be linearly averaged if fluctuations due to noise can 
be made small and that fluctuations can be made small by coupling the dynamical elements ap- 
propriately. In the stochastic setting adopted here, "synchronization" refers to state synchrony: the 
tendency for individual elements' trajectories to move towards a common trajectory, in a quan- 
tifiable sense. The estimates we present directly characterize the tradeoff between the network's 
tendency towards synchrony and the noise, and ultimately address the specific role this tradeoff 
plays in determining uncertainty surrounding a function learned by an imperfect learning system. 

We further show how and where the topology of the network of neural ensembles impacts the 
extent to which the noise, and therefore uncertainty, can be controlled. The estimates we provide 
take into account in a fundamental way both the nonlinearity in the dynamics and the noise. More 
generally, the work discussed here also has implications in other related domains, such as networks 
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of coupled learners or adaptive sensor networks, and can be extended to multitask online or dy- 
namic learning settings. The difficulty inherent in analyzing dynam ic learning systems, such as 
hierarchies with feedback (IMuirifordl.ll992l : lLee and Mumfordl. 120031) . poses a challenge. But con- 
sidering dynamic systems can yield substantial benefits: tr ansients can be i mportant, as suggested 
by the literature on regularization paths and early stopping (lYao et all 120070 . Furthermore, the role 
of feedback/backprojections and attention-like mechanisms in learning and reco gnition systems, 



both biological and artific i al, is known to be important but is not well understood (IHahnloser et al. 



19991 : lltti and Kochl. l2001l : iHung et all 120050 . 



The paper is organized as follows. In Section [2] we consider a specific learning problem and 
define a system of stochastic differential equations (SDEs) modeling a simple dynamic learning 
process. We then discuss stability and network topology in the context of synchronization. In Sec- 
tion|3]we present the main theoretical results of the paper, a set of uncertainty estimates, postponing 
proofs until later. Then in Section [5] we provide simulations and compare empirical estimates to 
the theoretical quantities predicted by the Theorems in Section [3] Section [4] provides a discussion 
addressing the significance and applicability of our theoretical contributions to neuroscience and 
behavior. Finally, in Section [6] we give proofs of the results stated in Section [3] 



2 Biological Learning as a Stochastic Network Model 

The learning process we will model is that of a one-dimensional linear fitting proble m described by 



gradi ent based minimization of a square loss objective, in the spirit of Rao & Ballard (IRao and Ballard 



1999). This is perhaps the simplest and most fundamental abstract learning problem that an organ- 
ism might be confronted with - that of using experiential evidence to infer correlations and ulti- 
mately discover causal relationships which govern the environment and which can be used to make 
predictions about the future. The model realizing this learning process is also simple, in that we 
capture neural communication as an abstract process "in which a neural element (a single neuron 
or a p opulation of neurons) conve ys certain aspects of its functional state to another neural ele- 



ment" (|Schnitzler and GrossL I2005P . In doing so, we focus on the underlying computations taking 
place in the nervous system, rather than dwell on neural representations. Even this simple setting 
becomes involved technically, and is rich enough to explore all of the key themes discussed above. 
Our model also supports nonlinear decision functions in the sense that we might consider taking a 
linear function of nonlinear variables whose values might be computed upstream. In this case the 
development would be similar, but extended to the multidimensional case. The model may also be 
extended to richer function classes and more exotic loss functions directly, however for our pur- 
poses the additional generality does not yield significant further insight and furthermore might raise 
biological plausibility concerns 1 . 



2.1 Problem Setup 

To make the setting more concrete, we begin by assuming that we have observed a set of input- 
output examples {x{ S R, yi E R}™ i, each representing a generic unit of sensory experience, and 
want to estimate the linear regression function f w (x) = wx. Adopting the square loss, the total 
error incurred on the observations by f w is given by the familiar expression 

m m 

E ( w ) = ^2(Vi ~ fw{ x i)f = ^{Vi ~ WXif. 
i=l i=l 

'in the sense that one would have to carefully justify biologically the particular nonlinearities going into a nonlinear 
decision function on a case-by-case basis. 
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We will model adaptation (training) by a noisy gradient descent process, with biologically plausible 
dynamics, on this squared prediction error loss function. The trajectory of the slope parameter 
over time w(t) and its governing dynamics may be represented in the biology in various forms. 
Stochastic rate codes, average activities in populations of neurons and population codes, localized 
direct electrical signals and chemical concentration gradients are some possibilities occurring across 
a range of scales. The dynamical system may also be interpreted as modeling the noisy, time- varying 
strength of a local field potential or other macro electrophysiological signal when there are multiple, 
interacting brain regions. We discuss these possibilities further in Section HI 

The gradient of E with respect to the weight parameter is given by V W E = — X^LiG/i ~~ 
wxi)xi, and serves as the starting point. The gradient dynamics w = —X7 w E(w) are both linear 
and noise-free. Following the discussion above, we modify these dynamics to capture nonlinear 
saturation effects as well as (often substantial) noise modeling error. Saturation effects lead to a 
saturated gradient which we model in the form of the hyperbolic tangent nonlinearity, 

w = —t&nh(aV w E(w)), 

where a is a slope parameter. Note that the saturated dynamics need not be interpretable as itself 
the gradient of an appropriate loss function. The fundamental learning problem is defined by the 
square-loss, but it is implemented using an imperfect mechanism which imposes the nonlinearity 2 . 
The error is modeled with an additional diffusion (noise) term giving the SDE 

dwt = — t&nh(dV w E(wt))dt + crdBt, (1) 

where dB t denotes the standard 1 -dimensional Wiener increment process with standard deviation 
a > 0. As m entioned before, th is noise term adB t and corresponding error is due to intrinsic 



neuronal noise (IFaisal et all 120081) (aggregated or localized) and possible interference between large 
assemblies of neurons or circuits and parallels the more general concept of measurement error in 
networks of coupled dynamical systems. 

2.2 Synchronization and Noise 

We now consider the effect of having n independent copies of the neural system or pathway imple- 
menting the dynamics £T|), with associated parameters {wi(t), . . . , w n (t)}. Since these dynamics 
are nonlinear, the effect of the noise cannot be reduced by simply averaging over the independent 
trajectories. However, if the circuit copies are coupled strongly enough they will attempt to syn- 
chronize, and averaging over the copi e s becomes a potentially powerful way to reduce the effect of 



the noise (ISherman and Rinzell. 1 1991k iNeedleman et all 120011) . The noise can be potentially large 



(we do not make any small-noise assumptions), and will of course act to break the synchrony. We 
will explore how well the noise can be reduced by synchronization and redundancy in the sections 
that follow. 

Given n diffusively coupled copies of the noisy neural system, and setting a = 1 in (Q]), we have 
the following system of nonlinear SDEs: 



dwi(t) = — tanh 



dt + J2 W iJ ( W J ~Wi)dt + adBf ] (2) 



(i) 

for i = 1, . . . , n, where B± are independent standard Wiener processes. The diffusive couplings 
here should be interpreted as modeling abstract intercommunication between and among different 



2 Put differently, in our setting the nonlinearity is not part of the learning problem, and so the saturated gradient 
dynamics should not be viewed as the gradient of another error criteria 
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neural circuits, populations, or pathways. In such a general setting, diffusive coupling is a natural 
and mathematically tractable choice that can capture the key, aggregate aspects of communication 
among neural syste ms. Electrical connections such as those imple mented by gap junctions in the 



mammalian cortex (|Fukuda et all 120061 : Bennett and Zukinl. 12004) are also modeled well by dif- 



fusive coupling terms when individual neurons are being discussed, however we emphasize that 
the system in Equation © is a conceptual model involving possibly large brain regions and do 
not make assumptions at a level of biological detail that would invoke or require gap-junction type 
connectivity. 

Each copy of the basic neural circuit is corrupted by independent noise processes but follows 
the same noise-free dynamics as the others, modulo initial conditions. In fact these coupled systems 
may start from very different initial conditions. We will assume for simplicity uniform symmetric 
weights Wji = Wij = k > when element i is connected to element j. Defining (w)j to be the 
(scalar) output of the z-th circuit, we can rewrite the system (0 in vector form as 



m 

dw(t) = -(tanh^( w(t)xi — yiTL^xA + Lw(t)Jdt + adB 

i=l 



(3) 



where L = diag(Wl) — W is the network Laplacian, and B t is the standard n-dimensional Wiener 
process. The spectrum of the network Laplacian captures important properties of the network's 



topology, and will play a key role. Finally, the change of variable X t 
(x)j = Xi, (y)j = yi, yields a system that will be easier to analyze: 



w x 



dX t = -(tanh(X t )||x|| 2 + LX t )dt + ddB t 



(x,y)l, with 



(4) 



where we have defined a 



a x 



The unique globally stable equilibrium point for the deter- 



ministic part of (01) is seen to be X* = 0, which checks with the fact that the solution to the linear 
regression problem is w* = (x, y) / (x, x) in this simple case. 



2.3 Role of Network Topology 

The topology of a network of dynamical systems strongly influences synchronization, to include 
the rate at which elements synchronize and whether sync (or the tendency to sync) can occur at all 
in the first place. Thus the pattern of interconnections among neural systems plays an important 
role in controlling uncertainty by way of synchronization properties. In a network of stochastic 
systems of the general (diffusive) type described in Section |2l topology can be seen to influence 
the robustness of synchrony to noise through the spectrum of the network Laplacian. Laplacians 
arising in various interesting networks and applications have received much attention, both in bi- 
ologi cal decision making and in the contex t of synchronization of dynamical systems more gener 



ally ( Kopell and Ermentrout 



Taylor et all 120091 : iPoulakakis et al. 



1986 



KopellL l2000l : Ijadbabaieetall 120031 : IWang and Slotind. 120051 : 



2010). 



We will consider four important network graphs here, and these arrangements will be helpful 
examples to keep in mind when interpreting the results given in Section [3] The simplest graph 
of coupled elements is perhaps the full, all-to-all graph. As one may guess, this network is also 
the easiest to synchronize since each element can speak directly to the others. The spectrum of 
the network Laplacian A(L) for this graph shows why it might be especially effective for reducing 
uncertainty in the context of Equation ©. With uniform coupling strength k > and n denoting 
the number of elements in the network, one can check that A(L) = {0, uk, . . . , uk}. Denote by A_ 
the smallest non-zero (Fiedler) eigenvalue, and by A + the largest eigenvalue. Here A_ = A + = nn 
and it is these eigenvalues that control synchronization for any given network. As we will show 
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Figure 1: Examples of (undirected) network graphs. 



below in Theorem l3.ll the effect of the noise can be reduced particularly quickly precisely because 
the non-zero eigenvalues depend on both parameters, k and n. 

If fewer connections are made in the network it becomes harder to synchronize, and we move 
away from the all-to-all ideal. Figure Q] shows some other common network graphs. The undirected 
ring graph, appearing in the middle, has spectrum Aj(L) = 2k[1 — cos(^(i — 1))] ,i = 1, . . . , n. 
If the single edge connecting the first and last elements is removed to make a chain as shown on the 
left in the Figure, the network becomes considerably harder to synchronize (IKopell and Ermentrout , 



19860 . although the spectrum of the chain looks similar: Xi(L) = 2k[1 — cos(^(i — 1))]- This 



makes intuitive sense because information is constrained to flow through only one path, and with 
possibly significant delays. Finally, the star graph shown on the right in the Figure has spectrum 
A(L) = {0, k, Tin}, and we can see that the key Fiedler eigenvalue A_ = k does not grow 
with the size of the network n. The Theorems in Section [3] then predict that it will be impossible 
to increase the synchronization rate simply by incorporating more copies of the neural circuit. The 
coupling strength must also increase to make fluctuations from the common trajectory (synchro- 
nization subspace) small. We will discuss this case in more detail. As might be particularly relevant 
to brain anat omy, random gr aphs and directed graphs may also be considered, and have been studied 
extensively (|Bollobasl.l200lh . 

In neuroscience-related models, each connection in a network has an associated biophysical 
cost in terms of energy and space requirements. All-to-all networks, with n 2 connections among n 
circuits or neurons, is often criticized as being biologically unrealistic because of this cost. However, 
it has been not ed that all-to-all con nectivity can be implemented with 2n connections using quorum 
sensing ideas ( Taylor et al. , 2009h . wherein a global average is computed and shared. The global 
average is computed given inputs from all n elements, and this average is sent back to each circuit 
via another n connections. The shared variable may be communicated by synapses, or sensed 
chemically or electrically. Although quorum sensing cannot realize any set of n 2 connections, the 
global average may be a weighted average or there may be several common variables organized 
hierarchically. This allows for a rich set of networks with 0(n) connectivity which behave more 
like networks with all-to-all connectivity for synchronization and stability purposes. Furthermore, 
dynamics in the computation of the quorum variable itself, when appropriate for modeling purposes, 
does not necessar ily pose any special diffic ulty for establishing synchronization properties if virtual 
systems are used (IRusso and Slotind.l2010f) . 

The difficulty with which synchrony may be imposed can be "normalized" by the number of 
connections in many cases to obtain a comparison between synchronization properties of vari- 
ous graphs that takes biological cost into account. Using quorum variables where appropriate, 
graphs whose spectrums depend on n are thus roughly comparable on equal biological terms. Cost- 
normalized comparisons of synchronization properties are not always possible or meaningful, how- 
ever. Consider the ring and chain networks introduced above. There is a difference of one edge 
between the two, but in the noise-free setting for example the chai n requires asymptotically four 
times more effort to synchronize than the ring architecture (see e.g. (Wang and Slotind.l2005|) . Ex- 
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ample 4.5). 



2.4 A Comment on Stability and Contraction 

We turn to analyzing the stability of the nonlinear system given by Equation (01). We will argue that 
this is difficult for two reasons: the presence of noise, and the fact that the (noise-free) dynamics 
saturate in magnitude. Indeed, without additional assumptions, one cannot in general show that the 
system is globally exponentially stable. A common method for studying the stability properties of 
a noiseless nonlinear dynamical system is via Lyapunov theory (|Slotine and LiL 1 199 lh . however in 
the presence of noise syste m trajectories along the Lyapunov surface may not be strictly decreas- 



ing. Contraction analysis (|Lohmiller and Slotind . 119981 : IWang and Slotind . 120050 is a differential 
formalism related to Lyapunov exponents, and captures the notion that a system is stable in some 
region if initial conditions or temporary disturbances are forgotten. If all neighboring trajectories 
converge to each other, global exponential convergence to a single trajectory can be concluded: 

Definition 2.1 (Contraction). Given the system equations x = f(x, t), a region of the state space 



is called a contraction region if the Jacobian Jf 



9£ 

Ox- 



is uniformly negative definite in that region. 



Furthermore, the contraction rate is given by (3, where §( J f + jj) < fil < 0. 



An analogous definition in the case stochastic dynamics has also been developed (IPham et al 



2009|), and requires contraction of the noise-free dynamics as well as a uniform upper bound on the 
variance of the noise. However for the system ©, the Jacobian is found to be 

J(w) = ||x|| 2 diag(tanh 2 (w) — 1) — L 

so that A( J(w)) < — A(L) < — A m ; n (L) = 0. The subspace of constant vectors is a flow invariant 
subspace, and L does not contribute to the dynamics in this flow invariant space since L has a 
zero eigenvalue corresponding to its constant eigenvector. This difficulty can arise whenever one 
considers diffusively coupled elements, and in suc h cases the usual way aro und this difficulty is to 
work with an auxiliary or virtual system (as in e.g. (|Pham and Slotind. l2007h ) and study contraction 
to the flow invariant subspace starting from initial conditions outside. However since tanh'(x) = 
1 — tanh 2 (x), we still are left with the difficulty that the noise-free dynamics can have a convergence 
rate to equilibrium arbitrarily close to zero as one travels far out to the tails of the tanh function; the 
system is not necessarily contracting. Indeed, for any saturated dynamics, tanh(f (x, i)) , the rate 
can be arbitrarily small. Thus one cannot easily determine the rate of convergence to equilibrium 
using standard techniques. The analysis which we provide in the succeeding sections will attempt 
to get around these difficulties by separately exploring the system's behavior in and out of the flow- 
invariant (synchronization) subspace of constant vectors. 



3 Controlling Uncertainty in Learning 

In this section we present and interpret the main results of the paper. The argument we put forward 
is that noisy, nonlinear trajectories can be linearly averaged to reduce the noise if fluctuations due 
to noise can be made small. We show that the fluctuations can be made small by coupling the 
dynamical systems, and that one can precisely control the size of the fluctuations. In particular, 
we give estimates which show that the tradeoff between noise and coupling strength among neural 
circuits determines the amount of uncertainty surrounding the decisions made by the neural system. 
Proofs of the Theorems are postponed until Section [6] 
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3.1 Preliminaries 



We begin by decomposing the stochastic process {X t € M n }t>o into a sum describing fluctuations 
about the center of mass. Let P = I — (l/re)M T , the canonical projection onto the zero-mean 
subspace of W 1 , and define Q = I - P. Then for all t > 0, X t = PX t + QX t . Clearly, 
ker P = im Q is the subspace of constant vectors. We will adopt the notation X t for PX t , and X t TL 
for QX t (along with the analogous notation w< and wt), and derive expressions for these quantities 
based on Equation (@]). The macroscopic variable X t satisfies 

llxll 2 a 
dX t = -K T dX t = -^-H. T t<m\i(X t )dt + —=dB t (5) 

n n ^Jn 

and thus 

dX t = dX t - dX t l = -ftanh(X t ) llxll 2 + LX t - ^-1 T t&nh(X t )l) dt + ° dB t - -%=dB t l. 

\ n J y/n 

(6) 

In terms of the original variable w, the fluctuations w t are purely due to the noise, while w t parame- 
terizes the average decision function. As the decision function we consider is linear, the uncertainty 
in the decisions is directly equivalent to uncertainty in the parameter w. We will study the evolution 
of both the mean and the fluctuation processes over time, however to assess uncertainty the central 
quantity of interest will be the size of the ball containing the fluctuations (the "microscopic" vari- 
ables). We characterize the magnitude of the fluctuations via the squared norm process satisfying 

d\\Xt\\ 2 _ /„ , |2/i? , TV ,\ U , 1 



2 



(||x|| 2 (X t ,tanh(X t )) + (X t , LX t ))dt + -a 2 (n - l)dt + a\\X t \\dB t (7) 



which follows from © applying Ito's Lemma to the function h(X t ) = \{X t , X t ) and the fact that 
(X t , dBt) = \\X t \\dB t in law. 

3.2 Uncertainty Estimates 

The first -and central- result says that the ball centered at w (the center of mass) containing the 
fluctuations can be controlled in expectation from above and below by the coupling strength and in 
most cases the number of circuit copies, via the spectrum of the network Laplacian L. We note that 
lower bounds are typically ignored in the dynamical systems literature, possible because they are 
less important for stability analyses. We have found, however, that such bounds can be derived in 
the case of saturated gradient dynamics, and that control from below can yield further insight into 
the present problem of neural learning. 

Let A_|_ be the largest eigenvalue of L, and let A_ be the smallest non-zero eigenvalue of L. 

Theorem 3.1 (Fluctuations can be made small). After transients of rate 2A_ 



2A+ V A_ J ~ 11 w " _ 2A 
where w = Pw(t). 

Clearly the lower bound is informative only when A_ > ||x|| 2 . While we do not explicitly 
assume any particular bound on the size of the examples ||x||, it is reasonable that A_ 3> ||x|| 2 
since A_ can depend on the number of circuits n and will always depend on the coupling strength k, 
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which can be large. Large coupling s trengths can be found in a variety of circum stances, particularly 
in the case of motor control circuits ( Grandhe et al. . 1999 : Kiemel et all. I2003T ) for example. 

In the next Theorem we give the variance of the fluctuations via a higher moment of ||w||. This 
result makes use of the lower bound in Theorem 13 - 1 L and leads to a result that gives control of the 
fluctuations in probability rather than in expectation. 

Theorem 3.2 (Variance of the trajectory distances to the center of mass). After transients of rate 
2A_ 



var w 



< 



(n- l)cr s 
2A_ 



n 



(n - l)a< 
2A+ 



A_ 



Chebyshev's inequality combined with Theorem l3.2l immediately gives the following Corollary. 



Corollary 3.1. After transients of rate 2A_ 



|w(i)|r -E||w(t)|r > e 



< 



var 



(llw(i)ll 2 ) 



(8) 



Since any connected network graph has non-trivial eigenvalues which depend on the uniform 
coupling strength k, we see that for fixed n as k — > oo, var(||w(i)|| 2 ) — > 0. In the case of the 
all-to-all network topology, for example, the eigenvalues of L depend on both n and k, so that 



var(||w(i) 



0(k ) giving a power law decay of order 0(k e ) on the right hand side of 



Equation © in Corollary 13.11 

Finally, we turn to estimating in expectation the steady-state average distance between the tra- 
jectories of the circuit copies and the noise-free solution. As we have argued in Section [2~4l the rate 
of convergence to equilibrium of the trajectories Wi(t) can be arbitrarily small. Although from the 
Theorems above the fluctuations can be made small, one cannot in general make a similar statement 
about the center of mass w t process unless assumptions about the initial conditions are made (and 
by extension, the same holds true for the trajectories Wi{t)). Such an assumption would lead to 
control over the contribution of the tanh terms, and establishes a lower bound on the contraction 
rate. Rather than make a specific assumption however, we state a general result: We again provide 
a lower bound, this time following from the law of large numbers governing sums of i.i.d. Gaussian 
random variables and the lower bound on the fluctuations provided by Theorem 13. II 

Theorem 3.3 (Average distance to the noise-free trajectory). Denote by w* the minimizer of the 
squared-error objective (12- 1 b - After transients of rate 2A_ 



a 2 

— + 

n 



(n - 1> 2 
2nA I 



x 



A_ 



< E 



1 

-]TMt)-u,*) 2 



i=l 



< 



2X 



w 



*\21 



max(0, ' 



where 



Theorem 13.31 says that average closeness of the noisy system to that of the noise-free optimum 
is controlled by the tradeoff between the noise and the coupling strength, and the number of circuit 
copies n. The former controls in large part the magnitude of the fluctuations, as discussed above. 
The latter quantity is the unavoidable linear averaging component, and can be brought to zero only 
as fast as the law of large numbers allows, 0(ra -1 / 2 ) at best. For fixed n as A^ — > oo, the upper and 
lower bounds coincide since E[(-u)(t) — w*) 2 ~\ — ^ a 2 /n. As both n — > oo and k — > oo Theorem 13 .3 1 
confirms that Wi(t) — > w* in expectation. If the fluctuations are not made small however, linear 
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averaging will be wrong, and the error will of course be greater. Just how bad linear averaging is 
when the fluctuations are allowed to be large is described in large part by the maximum curvature 
of the noise-free dynamics 3 . 

Finally, we note that the estimates above depend on the number of samples m only through 
the norm of the examples x, and it is reasonable to assume that this quantity may be appropriately 
normalized based on the maximum values conveyed by subsystems or rates of neurons comprising 
the circuit in the case of population or rate codes, or maximum field strengths in the case of LFPs. 
However, the requirement that the organism must collect m observations before learning can pro- 
ceed is not essential. We may also consider the online learning setting, where data are observed 
sequentially and updates to the parameters (w)i are made separately on the basis of each observa- 
tion in temporal order. The analysis above studies convergence to and distance from the solution in 
the steady state, whatever that solution may be, given m pieces of evidence. Thus the online setting 
can also be considered as long as the time between observations is longer than the transient periods. 
Indeed, in many scenarios learning and decision making processes in the brain can take place on 
short time scales relative to the time scale on which experience is accumulated. In this case when 
another piece of information arrives, the system moves to a region defined (stochastically) around a 
new steady-state. A complication can arise when the new point arrives during the transient period 
of the previous learning process - before the system has had a chance to settle, on average, into the 
new equilibrium - however we do not attempt to model this situation here. 

4 Discussion 

The estimates given in Section[3]quantify the tradeoff between the degree of synchronization and the 
noise (error), and the role this tradeoff plays in determining the uncertainty of a decision function 
learned by way of a stochastic, nonlinear dynamics. Estimates both in expectation and in probability 
were derived. We showed how and where both the coupling strength and the topology of the network 
of neural ensembles impact the extent to which the noise, and therefore uncertainty due to error, can 
be controlled. In particular, for most networks (see Section l23T ) the effect of the noise can be reduced 
by either increasing the coupling strength or the number of redundant systems (or both), leading to a 
steady-state solution that is going to be closer to the ideal, error-free solution with high probability. 
From a technical standpoint, this is because fluctuations about the common trajectory are exactly the 
way in which the noise enters the picture; when the fluctuations are made small, the error is made 
small. In this way an organism may mitigate error imposed by a noisy, imperfect learning apparatus 
and solve a learning task to greater accuracy. Furthermore, synchronization and redundancy can 
both improve the speed of learning, in the sense that the rate of convergence to the steady state 
solution also depends on these mechanisms. Each of the bounds presented in Section [3] above hold 
after transient terms of order e~ <A ~ vanish, where A_ is the smallest non-zero eigenvalue of the 
network Laplacian. For any stable connected network, strong coupling strengths directly improve 
convergence rates to the steady-state, as seen by the dependence of A_ on k. In the case of all-to-all 
(including approximately all-to-all and many random graphs), A_ = 0{nn) so that both increased 
redundancy and sync will improve the speed of learning. 

Our overarching goal has been to explore quantitatively the role of redundancy and synchro- 
nization in reducing error incurred by a stochastic, non-ideal learning and decision making neural 
process. We have gone about this by considering a model which emphasizes the underlying compu- 

3 One way to see this is to take the first-order Taylor expansion of the dynamics with integral remainder. The re- 
mainder term can be u pper bounded by the spectral radius of the Hessian matrix, which is related to curvature (see e.g. 
jTabareau et"aill2010h ). 
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tations taking place rather than particular neural representations. Looking at the appropriate scale, 
we seek to address the precise meaning of ensemble measurements and population codes, as well as 
the information these codes convey about the underlying dynamics and signals. The results derived 
above support the notion that synchronization and redundancy play a more functional role in the 
context of learning processes in the brain, rather than being a mere epiphenomenon. 



4.1 Synchronization and Redundancy in the Brain 



Synchronization has been suggested, over a diverse history of experimental work, as a fundamen- 
tal mechani sm for improvement in precision and r eduction of uncertainty in the nervous system 
(see e.g. (|Needleman et all. l200ll : lEnrightL ll980l V). R edundancy too is an important and com - 
monly occurrin g mechanism. In retinal ganglion cells ( Croner et al. , 1993 : Puchalla et all l2005h 
and heart cells ( Clay and DeHaan . 1979h the spatial mean across coupled cells cancels out noise. 
Populatio ns of hair cells in otoliths perform redundant, coll aborative computations to achieve ro- 
bustness ( Kandel et al. . 2000 : Eliasmith and Andersonl. 2004), and it has been suggested that multi- 
ple cortica l (amygdala-tha lamus) loops contribute to fear response/conditioning, and emotion more 
generally (|LeDouxl. 120001 ) . With motor tasks such as reaching or standing, it has been argued that 
planning and representation occ urs at least partially in red undant coor dinate systems a nd involve 
redundant degrees of freedom ( Scholz and Schoner . 1999 ). Todorov ( Todorov . 2008h maintains 
that redundancy and noise combine to give rise to optimal muscle control policies, raising the in- 
teresting possibility that in some cases the impact of the noise may need to be adjusted but not 
necessarily eliminated altogether. On a more localized scale, reach dire ction has also been found to 
be conveyed by populations of neurons with overlapping tuning cur ves (IGeorgopoulos et all 1 1982) 
where synchrony within such populations plays an important role (|Grammont and Riehld . Il999h . 
Multiple sensorimotor transformations involving disparate brain regions may be at play in the pari- 
etal cortex, whe r e redundant sensory inputs from m ultiple modalities must be mapped into motor 
responses (ITingl. 120071 : IPouget and Sejnowskil. I1997T) . In the ascending auditory pathway, varying 
degrees of redundancy have bee n noted, and contribu te to the robust representation of frequency and 
more complex auditory objects ( Chechik et al. , 20061) . Ensemble measurements have also been con- 
nected to behavior and have been suggested as inputs to brain-machine interfaces, while in stochastic 
neural decision making it has been suggested that it is the collective behavior across multiple pop- 
ulations of neurons that is responsible f or perception and de cision making, rather than activity of a 
single neuron or population of neurons ( Gigante et al. . 20091) . 

In these examples and more generally, we suggest that redundancy plus feedback synchroniza- 
tion is a mechanism which may be used to improve the accuracy, robustness and speed of a learning 
process involving the relevant respective brain areas. This is separate from, and in contrast with, re- 
dundancies whic h are harnessed to s pecifically increase storage capacity, as in the case of associate 
memory models (IHertz et all . Il99ll) . There, robustness to corruption is also achieved (via pattern 
completion dynamics) but the degree of robustness must be traded off against capacity. The primary 
function of such populations of neurons is to ostensibly store and retrieve memory patterns rather 
than to implement adaptive, learning dynamics while eliminating noise. 

Another theme emerging from these instances of sync and redundancy, is that key computations 
may be seen as implemented by distant brain regions coupled together by way of long-distance 
projections and network "hubs". R ecent experimenta l obser vations in C. elegans casts this inter- 
pretation in a developmental light (IVarier and Kaiserj. l201ll) . and suggests that such interactions 
occur from an early stage in life and are important for normal development in even simple organ- 
isms. Learning processes realized by such computations and interactions are certainly susceptible 
to noise, and must cope with this noise one way or another. We suggest that synchronization and 
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redundancy are not only present and possible, but provide a ready, natural solution. 

The ability to learn and make decisions reliably in the presence of uncertainty is of fundamental 
importance for survival of any organism. This uncertainty can be seen to arise from three distinct 
sources, and the approach discussed here treats only the first two: intrinsic neuronal noise, both 
local and in aggregate, and noise in the form of measurement error, under which we include error 
due to limitations in precision and nonlinearity in biological systems. A third and equally i mpor- 



tant source of error is th at of uncertainty in the inference process itself (lYang and Shadlenl . 120071 : 



Kiani and Shadlenl . l2009r ). This uncertainty is specific to and inherent in the decision problem and is 
characterized by the posterior distribution over decisions given the experiential evidence. Our work 
only considers uncertainty beyond that of the inference process, and as such is one part of a larger 
puzzle. We argue that intrinsic noise is both experimentally and theoretically important - and in- 
volved enough technically - to be addressed in isolation, while holding all other variables constant. 
Indeed, intrinsic noise intensities can be large. The role of the network's topology and coupling 
mechanism also strongly influences the overall picture, often in surprising or subtle ways. But it 
is also possible that the methods recruited here can be applied towards understanding some aspect 
of the inference error if different inferences from the same observations can be made by different 
"expert" (circuits) each with their own biases. Then averaging, nonlinearity and the uncertainty 
could potentially be treated in a similar framework. 

4.2 Extensions and Generalizations 

Asymptotic stability of the stochastic system considered here is guaranteed as long as there is cou- 
pling. In general, if the dynamics of a stochastic system are contracting or can be made contracting 
with fee dback, then combinations (e.g. parallel, serial, hierarchical) of such systems will be con- 
tracting ( Pham et al , 2009 : Lohmiller and Slotine , 1998r) . In the present setting, the system govern- 



ing the fluctuations about the mean trajectory is contracting with a rate dependent on the coupling 
strength and the noise variance. Thus combinations of learning systems of the general type con- 
sidered here can enjoy strong stability guarantees automatically, since the individual systems are 
contracting. 

Finally, we have assumed throughout that the errors affecting the collection of redundant neural 
circuits or systems are mutually independent. This is not an unreasonable modeling assumption: 
For large-scale learning processes involving different brain areas, noise imposed by local spike ir- 
regularities is largely unrelated to noise present in distant circuits. Within small populations of 
neurons, it is likely that dependence among intrinsic neuronal noise sources decays rapidly in space 
so that nearest-neighbors may experience somewhat correlated noise, but beyond this are not sig- 
nificantly impacted by other members of the population. As noises in a biological environment 
can never be fully dependent (whether due to thermal or chemical-kinetic factors, or otherwise), 
partial-dependen ce among noise inputs may be explicitly modeled as, for example, mixing pro- 



cesses if desired (|Doukhanl . 1 1 9941) . Estimates of the form discussed here would then be augmented 
with mixing terms leading to results which make identical qualitative statements about the role of 
redundancy and sync. Fluctuations, and the effect of the noise, would still be reducible but would 
require larger coupling strengths or more redundancy compared to what would be necessary if the 
noise sources were independent. 
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Time(s) Time(s) 

Figure 2: (Left) Typical simulated trajectories for coupled and uncoupled networks driven by the 
same noise. (Right) Population average trajectories for the coupled and uncoupled systems. 

5 Simulations 

To empirically test the estimates given in Section[3]we simulated several systems of SDEs given by 
Equation (@]) using Euler-Maruyama integration (over time t £ [0, 10s], 10 5 regularly spaced sample 
points), for different settings of the parameters n (number of circuits or elements), k (coupling 
strength) and a (noise standard deviation). Initial conditions were randomly drawn from the uniform 
distribution on [—5, 5], and we fixed ||x|| 2 = 1 and the coupling arrangement to all-to-all coupling 
with fixed strength determined by k. For simplicity the simulated systems had equilibrium point 
at zero, corresponding to y = 0, so that (x, y) = and X* = w* (the change of variables is the 
identity map and we can identify X t with wt). 

For comparison purposes we first show on the left in Figure |2] typical simulated trajectories of 
uncoupled (top) and coupled (bottom) populations when n = 20, k = 5,a = W. Both populations 
are driven by the same noise and the same set of initial conditions, however each element is driven 
by noise independent from the others as assumed above. From the units on the vertical axes, one can 
see that coupling clearly reduces inter- trajectory fluctuations as expected. On the right in Figure |2l 
we show the coupled/uncoupled populations' respective center of mass trajectories for this particular 
simulation instance. One can see from this figure that the average of the coupled system tends closer 
to zero (X*), and is less affected by large noise excursions. 

To empirically test tightness of the estimates given in Section[3l we repeated simulations of each 
respective system 5000 times, and averaged the relevant outcomes to approximate the expectations 
appearing in the bounds. Transient periods were excluded in all cases. In Tables Q] through [5] 
we show the values predicted by the bounds and the corresponding simulated quantities, for each 
respective triple of system parameter settings. Sample standard deviations of the simulated averages 
(expectations) are given in parentheses. In Figure [3] we show theoretical versus simulated expected 
magnitudes of the fluctuations E||X t || 2 when n = 200 and a = 10 over a range of coupling 
strengths. The solid dark trace is the upper bound of Theorem 13.11 while the open circles are the 
average simulated quantities (again 5000 separate simulations were run for each k). Error bars are 
also given for the simulated expectations. Note that the magnitude scale (y-axis) is logarithmic, so 
the error bars are also plotted on a log scale. We omitted the lower theoretical bound from the plot 
because it is too close to the upper bound to visualize well relative to the scale of the bounds. 

Generally, the estimates relating to the magnitude of the fluctuations are seen to be tight, and the 
variance estimate is within an order of magnitude. For the experiments with large noise amplitudes, 
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Quantity 


Lower Bound 


Simulated 


Upper Bound 




9.405 


9.497 (std= 3.1) 


9.500 


var(||^|| 2 ) 




9.450 (std= 14.7) 


111.046 


±R\\X t - X*l\\ 2 

n II 1 II 


5.470 


12.249 (std= 22.2) 


12.249 (std= 22.2) 



Table 1: Estimates vs. simulated quantities: n = 20, k = 5, a = 10. 



Quantity 


Lower Bound 


Simulated 


Upper Bound 


E\\X t \\ 2 


11.281 


11.719 (std= 3.8) 


11.875 


var(||^|| 2 ) 




14.261 (std= 23.0) 


184.45 


hMx t - x*i\\ 2 

n II 1 II 


1.814 


1.933 (std= 2.5) 


1.946 (std= 2.4) 



Table 2: Estimates vs. simulated quantities: n = 20, K = 1, a = 5. 



Quantity 


Lower Bound 


Simulated 


Upper Bound 


E||Xt|| 2 


45.125 


47.053 (std= 15.2) 


47.500 


var(||^|| 2 ) 




230.275 (std= 373.4) 


2951.234 


±E\\X t -X*l\\ 2 

71 ' 1 II 


7.256 


14.761 (std= 24.1) 


14.784 (std= 24.1) 



Table 3: Estimates vs. simulated quantities: n = 20, k = l,a = 10. 



Quantity 


Lower Bound 


Simulated 


Upper Bound 


E\\X t \\ 2 


49.005 


49.556 (std= 7.0) 


49.500 


var(\\X t \\ 2 ) 




49.332 (std= 70.6) 


2598 


l M\\X t - X*l\\ 2 


1.490 


1.449 (std= 1.6) 


1.449 (std= 1.6) 



Table 4: Estimates vs. simulated quantities: n = 100, n = 1, a = 10. 



Quantity 


Lower Bound 


Simulated 


Upper Bound 


IE 1 1 J%T £ 1 1 2 


9.880 


10.137 (std= 1.5) 


9.900 


var(||^|| 2 ) 




2.151 (std= 3.2) 


102.362 


hMx t - x*i\\ 2 

n ii 1 ii 


1.099 


1.496 (std= 1.5) 


1.496 (std= 1.5) 



Table 5: Estimates vs. simulated quantities: n = 100, K = 5, a = 10. 
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Figure 3: Simulated vs. theoretical upper bound estimates of the fluctuations' expected magnitude 
over a range of coupling strengths k. Here n = 200 circuits and a = 10. 

the empirical estimates can appear to slightly violate the bounds where the bounds are tight since 
the variance across simulations is large. The lower bound estimating the distance of the center of 
mass to the noise-free solution is also seen to be reasonably good. For comparison, we give the 
upper estimate where the empirical distance is substituted in place of the expectation in order to 
show closeness to the lower bound. Theorem [33] predicts that the upper and lower estimates will 
eventually coincide if k and/or n are chosen large enough. 

6 Proofs 

In this section we provide proofs of the results discussed in Section [3] 

We first introduce a key Lemma to be used in the development immediately below. 

Lemma 6.1. Let P = I — (l/n)llL T , the canonical projection onto the zero mean subspace ofW 1 . 
Then for all x G R n 

< (Px,tanh(x)) < ||Px|| 2 , 
where the hyperbolic tangent applies elementwise. 

Proof. Given x € M. n , define the index sets / = {1, . . . , n}, I + = {i £ I \ (Px)i > 0}, and 
J_ = I\I+. Since Px is zero mean, Yliei (P x )i = Yliei- We will express the hyperbolic 

tangent as tanh(z) = 2s(2z) — 1, where s(z) = (1 + e~ 2 ) -1 is the logistic sigmoid function. If we 
let p, = -lie be the center of mass of x, (Px)i = Xi — fi > implies s(xi) > s(p,) by monotonicity 
of s. Likewise, (Px)i < implies s(xj) < s(/j,). Finally, note that since P 2 = P and 1 G kerP, 
(Px,tanh(x)) = (Px, P[2s(2x) — l)) = 2(Px, s(2x)}. Using these facts, we prove the lower 
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bound first: 

(Px,tanh(x)) = 2^ j (Px) i s{2x i ) - 2 |(Ps) i |a(2x i ) 
ie/+ iei- 

> 2s(2/i) ^(Px); - 2s(2/i) |(Px);| 
= 2s(2/i) -0 = 0. 

Turning to the upper bound, we prove the equivalent statement (Px,s(2x) — x) < 0. First, if 
(j, = 0, then Px = x so (Pec, tanh(x)) = (x,tanh(x)) < ||x|| ||tanri(x)|| < ||x|| 2 = ||Px|| 2 , since 
||tanh(x)|| < ||x|| by virtue of the fact that |tanh(z)| = tanh(|z|) < \z\ for any z G E. Now 
suppose that /i > 0. If z > fj, > 0, we can upper bound s{2z) by the line tangent to the point 
(//, s(2fj,)): s{2z) < mz + b with m < ^ and b > |. If z < /j,, we can take the lower bound 
s(2z) > 7}Z + \[i — s(2fi). Using these estimates, we have that 

(Px,s(2x)-x) = J2(Px)i(s(2xi) - Xi ) + \(Px)i\{xi-s(2xi)) 
iei+ iei- 

< £ (Px)i {b-(l- m)xi) + ^ | (Px)i \{\ Xi + \^- s(2fi)) 

iel+ 

< Y / (Px)i{b-(l-m)ti)+Yl \{Px)i\{\li+h-s{2 l i)) 

'^(Px) i )(6 + m//-s(2/i)) =0. 

'iei+ 

The second inequality follows from the fact that (1 — m) > 0, Xi > fi for i E I + and Xj < \i for 
i 6 L. Since X^e/ (-f 1 )* = J2iei_ I (P^)i|> an d recalling that by definition b satisfies mfi + b = 
s(2[i), the final equalities follow. If \i < 0, then the proof is similar, taking the line tangent to the 
point (n, s(2/i)) as a lower bound for s(2z) and the line \ {z — fi) + s(2fi) as an upper bound. □ 

6.1 Fluctuations Estimates: Proof of Theorem 1X11 

We begin by adding X\\X t \\ 2 dt, with A £ (0, oo), to both sides of Equation © to obtain 

\d\\X t \\ 2 + \\\X t \\ 2 dt = -||x|| 2 (tanhX t ,X t )cft + (A||X t || 2 - (LX t , X t ))dt 

+ X -{n - \)~a 2 dt + a\\X t \\dB t = e^ 2Xt d{\\\X t \\ 2 e 2Xt ), 

where the second equality follows noticing that the right hand side is the total Ito derivative of the 
left hand side of the first equality. Now multiply both sides by e 2Xt , switch to integral form, and 
multiply both sides by e~ 2Xt to arrive at 



l\\X t \\ 2 = e~ 2Xt \\X4 2 + e 2X ^ (i(n - l)a 2 - ||x|| 2 (tanhX s , X s ))ds 

+ f e 2X ^(X\\X s \\ 2 - (LX s ,X s ))ds + a f e 2X ^\\X s \\dB s . 
Jo Jo 



(9) 



Upper Bound: Next, note that (LX t ,X t ) = (LX t ,X t ) since L(X t l) = 0, and that X t is by 
definition orthogonal to any constant vector. For all t we also have that 

X.\\X t \\ 2 -(LX t ,X t )<0 
-(t&nhX t ,X t ) < 
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almost surely. The first inequality follows from the fact that for all x £ im P, 

\-\\X t \\ 2 < (LX t ,X t ) < X+\\X t \\ 2 , 

if A_ is the Fiedler eigenvalue of L and A + is the largest eigenvalue of L. The second inequality is 
given by Lemma [67TI Setting A = A_ and applying the inequalities (flOl ) to Equation © gives the 
estimate 

^||^|| 2 <e- 2A - f ||X || 2 + (re ~ 1) ^ f e 2X ~^ds + a f e 2X -^\\X s \\dB s 

2 Jo Jo 

= e- 2A -*||X || 2 + ( "7, 1)&2 (1 - e" 2A -*) + a f e 2X ~^\\X s \\dB s (11) 
4A_ ./n 



almost surely. Taking expectations and noting that E 



J*e 2X -(^\\X s \\dB s 



0, we have that 



,2^ (n-l)a 2 



E||X t r < v 2A y (12) 
after transients of rate 2A_. 

Lower Bound: We show that ¥,\\X t || 2 has a lower bound that can also be expressed in terms of the 
coupling strength and the noise level. The derivation is similar to that of the upper bound, and we 
begin with Equation (©. We set A = A + and apply the estimates A + ||X,|| 2 — (LX S ,X S ) > and 
(tanh X s , X s ) < \\X S \\ 2 for all s a.s., yielding 

l\\X t \\ 2 > e- 2A +*||^ || 2 + y"* e 2A + (s -* ) (|(n-l)^ 2 -|| X || 2 ||^ s || 2 )rf S +^ y* e 2A +( s -*>||^ s ||^ s . 
Taking expectations and integrating the Ito term, we have 

±E\\Xt\\ 2 > e" 2A +*E||X || 2 + (n ~ 1} ^ 2 (l - e" 2A +*) - ||x|| 2 f e 2X +^E\\X s \\ 2 ds. 

4A+ Jo 

After transients of rate 2A_, we can apply (fT2l ) to estimate the remaining integral and lower bound 
the above equation by 

e- 2A +*E||jf || 2 + (W 7, 1)ff2 (l " e~ 2X+t ) ~ ||x|| 2 (n ~y 2 (l - e~ 2A +'). 

Since A_ < A + , transients of rate 2A + have already transpired if we suppose that we have waited 
for transients of rate 2A_. Therefore, we can say that after transients of rate 2A_, 



6.1.1 Inverting the change of variables 

Finally, we can obtain corresponding upper and lower bounds for the original system © noting that 
since X t = P(w(i)||x|| 2 - (x, y)l) = ||x|| 2 Pw(t), we have E||w|| 2 = E||X t || 2 /||x|| 4 , where we 
have used the notation w for Pw. The ||x|| 4 in the denominator then cancels with the same quantity 
occurring in a 2 in Equations (fl"2l) and (fT3T ). giving the final form shown in Theorem l3.ll 
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6.2 Fluctuations Estimates: Proof of Theorem IXH 

We first derive the fourth moment of the norm of the fluctuations. Starting from Equation (fTTb . 
allow transients of rate 2A_ to pass so that we are left with the integral inequality 

i||X t f<^^^+a^e^-( s -*)||X s ||dB s . 

Squaring both sides, we can apply the identity (a + b) 2 < 2a 2 + 26 2 to obtain 

Taking expectations and invoking Ito's Isometry for the second term leads to 



K\\X t \\ 4 < 



(n - 1)<=H 



y/2X. 



\ +8a 2 ^e 4X -^E\\X s \\ 2 ds 



(n-l)a 2 \ 2 + 8^ f (n-l)a 2 \ = / (n-l)a 2 V / + 



V^2A_ / 4A_ V 2A - / V 2A - / V n ~ 1 

where the estimate ([121 has been substituted in for E||X S || 2 . An upper bound on the variance 
is then obtained from the identity var(Z 2 ) = E[Z 4 ] — (EZ 2 ) 2 and the lower estimate given in 
Equation (fl"3l) . Reversing the change of variables as in Section [6.1.11 yields the final result. 

6.3 Distance to the Noise-Free Trajectory: Proof of Theorem [3731 

Theorem 13. II can be applied towards providing a lower bound for the average distance between the 
noisy trajectories of the neural circuit and the noise-free solution to the learning problem. First 
observe that from the orthogonal decomposition X t = PX t + QX t and the change of variables 
mapping © to ©, 

||X t || 2 = \\X t l\\ 2 + \\X t \\ 2 = ||x|| 4 ||w - w*l\\ 2 . (14) 
Furthermore, we have that 

X t = n^ 1 Xi(t) = tT 1 ^M|x|| 2 - (x, y)), 

% i 

so evidently ||x|| _4 EX^ = E[(tD t — w*) 2 ]. Next, note that if the fluctuations are small, the tra- 
jectories {wi(t))f =1 are close to one another and the average trajectory wt = n _1 w(t) T l evolves 
essentially as Wt ~ w* + ^=Wt, where Wt is interpreted as a white noise process. In this case we 

then have that E[(w t — w*) 2 ] = and we see that E[(w t - w*) 2 ] > ^ when the fluctuations are 
not necessarily small. So we have that ||x|| _4 EX t > — . Combining the above with Theorem |3.1| 



a 2 

— + 

n 



{n-l)a 2 ( ||x|| 2x 



ex: EIIXJ 2 a 2 r , ,,,, 



with the notation [ • ] + = max(0, •). Equation (fT4l ) then shows that the middle quantity above is 
equal to E[iEr=iKW- ^) 2 ]- 
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